Table of Contents
Introduction¶
Welcome to our notebook analyzing mental health data. Herein we will be working through various techniques in data analysis and/or data science. The datasets were found by the author on Kaggle. Please contact me for further information at Jose.gj.chavez@gmail.com
!pwd
/Users/josechavez/WEB/jc_micro_projects/dataSnA/projectUno/notebooks

Importing the data set and a cursory examination¶
We import the necessary libraries (pandas as you may expect) and read in our csv data. Then we print using the head function with its default values (5) and print the columns of the dataframe.
import pandas as pd
df_mental_health = pd.read_csv('prevalence-by-mental-and-substance-use-disorder.csv')
print(df_mental_health.head())
# Examine the columns
print(df_mental_health.columns)
Entity Code Year \
0 Afghanistan AFG 1990
1 Afghanistan AFG 1991
2 Afghanistan AFG 1992
3 Afghanistan AFG 1993
4 Afghanistan AFG 1994
Prevalence - Schizophrenia - Sex: Both - Age: Age-standardized (Percent) \
0 0.228979
1 0.228120
2 0.227328
3 0.226468
4 0.225567
Prevalence - Bipolar disorder - Sex: Both - Age: Age-standardized (Percent) \
0 0.721207
1 0.719952
2 0.718418
3 0.717452
4 0.717012
Prevalence - Eating disorders - Sex: Both - Age: Age-standardized (Percent) \
0 0.131001
1 0.126395
2 0.121832
3 0.117942
4 0.114547
Prevalence - Anxiety disorders - Sex: Both - Age: Age-standardized (Percent) \
0 4.835127
1 4.821765
2 4.801434
3 4.789363
4 4.784923
Prevalence - Drug use disorders - Sex: Both - Age: Age-standardized (Percent) \
0 0.454202
1 0.447112
2 0.441190
3 0.435581
4 0.431822
Prevalence - Depressive disorders - Sex: Both - Age: Age-standardized (Percent) \
0 5.125291
1 5.116306
2 5.106558
3 5.100328
4 5.099424
Prevalence - Alcohol use disorders - Sex: Both - Age: Age-standardized (Percent)
0 0.444036
1 0.444250
2 0.445501
3 0.445958
4 0.445779
Index(['Entity', 'Code', 'Year',
'Prevalence - Schizophrenia - Sex: Both - Age: Age-standardized (Percent)',
'Prevalence - Bipolar disorder - Sex: Both - Age: Age-standardized (Percent)',
'Prevalence - Eating disorders - Sex: Both - Age: Age-standardized (Percent)',
'Prevalence - Anxiety disorders - Sex: Both - Age: Age-standardized (Percent)',
'Prevalence - Drug use disorders - Sex: Both - Age: Age-standardized (Percent)',
'Prevalence - Depressive disorders - Sex: Both - Age: Age-standardized (Percent)',
'Prevalence - Alcohol use disorders - Sex: Both - Age: Age-standardized (Percent)'],
dtype='object')
As you can see the dataframe is relatively straighforward. It consists of records of various countries over the years along with the corresponding prevalence rates of an assortments of mental disorders.
Focusing on Schizophrenia.¶
# Columns
#print(df_mental_health.columns)
df_new = df_mental_health[['Entity', 'Code', 'Year', 'Prevalence - Schizophrenia - Sex: Both - Age: Age-standardized (Percent)']]
df_new = df_new.rename(columns={'Prevalence - Schizophrenia - Sex: Both - Age: Age-standardized (Percent)': 'Schizo'})
print(df_new.head())
#let's see what unique contries are present
unique_entities = df_new['Entity'].unique()
print(unique_entities)
#df_us = df_new[df_new['Entity'] == 'United States']
print(df_new)
Entity Code Year Schizo
0 Afghanistan AFG 1990 0.228979
1 Afghanistan AFG 1991 0.228120
2 Afghanistan AFG 1992 0.227328
3 Afghanistan AFG 1993 0.226468
4 Afghanistan AFG 1994 0.225567
['Afghanistan' 'African Region (WHO)' 'Albania' 'Algeria' 'American Samoa'
'Andorra' 'Angola' 'Antigua and Barbuda' 'Argentina' 'Armenia'
'Australia' 'Austria' 'Azerbaijan' 'Bahamas' 'Bahrain' 'Bangladesh'
'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin' 'Bermuda' 'Bhutan'
'Bolivia' 'Bosnia and Herzegovina' 'Botswana' 'Brazil' 'Brunei'
'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia' 'Cameroon' 'Canada'
'Cape Verde' 'Central African Republic' 'Chad' 'Chile' 'China' 'Colombia'
'Comoros' 'Congo' 'Cook Islands' 'Costa Rica' "Cote d'Ivoire" 'Croatia'
'Cuba' 'Cyprus' 'Czechia' 'Democratic Republic of Congo' 'Denmark'
'Djibouti' 'Dominica' 'Dominican Republic' 'East Asia & Pacific (WB)'
'Eastern Mediterranean Region (WHO)' 'Ecuador' 'Egypt' 'El Salvador'
'England' 'Equatorial Guinea' 'Eritrea' 'Estonia' 'Eswatini' 'Ethiopia'
'Europe & Central Asia (WB)' 'European Region (WHO)' 'Fiji' 'Finland'
'France' 'G20' 'Gabon' 'Gambia' 'Georgia' 'Germany' 'Ghana' 'Greece'
'Greenland' 'Grenada' 'Guam' 'Guatemala' 'Guinea' 'Guinea-Bissau'
'Guyana' 'Haiti' 'Honduras' 'Hungary' 'Iceland' 'India' 'Indonesia'
'Iran' 'Iraq' 'Ireland' 'Israel' 'Italy' 'Jamaica' 'Japan' 'Jordan'
'Kazakhstan' 'Kenya' 'Kiribati' 'Kuwait' 'Kyrgyzstan' 'Laos'
'Latin America & Caribbean (WB)' 'Latvia' 'Lebanon' 'Lesotho' 'Liberia'
'Libya' 'Lithuania' 'Luxembourg' 'Madagascar' 'Malawi' 'Malaysia'
'Maldives' 'Mali' 'Malta' 'Marshall Islands' 'Mauritania' 'Mauritius'
'Mexico' 'Micronesia (country)' 'Middle East & North Africa (WB)'
'Moldova' 'Monaco' 'Mongolia' 'Montenegro' 'Morocco' 'Mozambique'
'Myanmar' 'Namibia' 'Nauru' 'Nepal' 'Netherlands' 'New Zealand'
'Nicaragua' 'Niger' 'Nigeria' 'Niue' 'North America (WB)' 'North Korea'
'North Macedonia' 'Northern Ireland' 'Northern Mariana Islands' 'Norway'
'OECD Countries' 'Oman' 'Pakistan' 'Palau' 'Palestine' 'Panama'
'Papua New Guinea' 'Paraguay' 'Peru' 'Philippines' 'Poland' 'Portugal'
'Puerto Rico' 'Qatar' 'Region of the Americas (WHO)' 'Romania' 'Russia'
'Rwanda' 'Saint Kitts and Nevis' 'Saint Lucia'
'Saint Vincent and the Grenadines' 'Samoa' 'San Marino'
'Sao Tome and Principe' 'Saudi Arabia' 'Scotland' 'Senegal' 'Serbia'
'Seychelles' 'Sierra Leone' 'Singapore' 'Slovakia' 'Slovenia'
'Solomon Islands' 'Somalia' 'South Africa' 'South Asia (WB)'
'South Korea' 'South Sudan' 'South-East Asia Region (WHO)' 'Spain'
'Sri Lanka' 'Sub-Saharan Africa (WB)' 'Sudan' 'Suriname' 'Sweden'
'Switzerland' 'Syria' 'Taiwan' 'Tajikistan' 'Tanzania' 'Thailand' 'Timor'
'Togo' 'Tokelau' 'Tonga' 'Trinidad and Tobago' 'Tunisia' 'Turkey'
'Turkmenistan' 'Tuvalu' 'Uganda' 'Ukraine' 'United Arab Emirates'
'United Kingdom' 'United States' 'United States Virgin Islands' 'Uruguay'
'Uzbekistan' 'Vanuatu' 'Venezuela' 'Vietnam' 'Wales'
'Western Pacific Region (WHO)' 'World' 'World Bank High Income'
'World Bank Low Income' 'World Bank Lower Middle Income'
'World Bank Upper Middle Income' 'Yemen' 'Zambia' 'Zimbabwe']
Entity Code Year Schizo
0 Afghanistan AFG 1990 0.228979
1 Afghanistan AFG 1991 0.228120
2 Afghanistan AFG 1992 0.227328
3 Afghanistan AFG 1993 0.226468
4 Afghanistan AFG 1994 0.225567
... ... ... ... ...
6835 Zimbabwe ZWE 2015 0.209359
6836 Zimbabwe ZWE 2016 0.209979
6837 Zimbabwe ZWE 2017 0.210631
6838 Zimbabwe ZWE 2018 0.211237
6839 Zimbabwe ZWE 2019 0.211969
[6840 rows x 4 columns]
Focusing on Schizophrenia, Alcoholism and Depression¶
Now we look to get start exploring the data set with respect to: depression, schizophrenia, and alcohol abuse. Notice that we use "\" to allow for continuing long lines (Python) and that we are creating a new dataframe using the column names of interest. The syntax for doing so is roughly speaking: df[[<comma separated list of strings for column names>]]
df_sad = df_mental_health[['Entity', 'Code', 'Year',\
'Prevalence - Schizophrenia - Sex: Both - Age: Age-standardized (Percent)',\
'Prevalence - Depressive disorders - Sex: Both - Age: Age-standardized (Percent)','Prevalence - Alcohol use disorders - Sex: Both - Age: Age-standardized (Percent)']]
print(df_sad.head())
#Prevalence - Depressive disorders - Sex: Both - Age: Age-standardized (Percent)',
# 'Prevalence - Alcohol use disorders - Sex: Both - Age: Age-standardized (Percent)'
Entity Code Year \ 0 Afghanistan AFG 1990 1 Afghanistan AFG 1991 2 Afghanistan AFG 1992 3 Afghanistan AFG 1993 4 Afghanistan AFG 1994 Prevalence - Schizophrenia - Sex: Both - Age: Age-standardized (Percent) \ 0 0.228979 1 0.228120 2 0.227328 3 0.226468 4 0.225567 Prevalence - Depressive disorders - Sex: Both - Age: Age-standardized (Percent) \ 0 5.125291 1 5.116306 2 5.106558 3 5.100328 4 5.099424 Prevalence - Alcohol use disorders - Sex: Both - Age: Age-standardized (Percent) 0 0.444036 1 0.444250 2 0.445501 3 0.445958 4 0.445779
Aside on long column names:¶
Notice that the column names are a little unweildy. Moreover, there are natural choices for shorter column names. We can use pandas methods to simplify/shorten the names while retaining the ability to print the dataframe and still know what the columns correspond to.
df_sad = df_sad.rename(columns={'Prevalence - Schizophrenia - Sex: Both - Age: Age-standardized (Percent)': 'Schizo'})
df_sad = df_sad.rename(columns={'Prevalence - Depressive disorders - Sex: Both - Age: Age-standardized (Percent)': 'Depp'})
df_sad = df_sad.rename(columns={'Prevalence - Alcohol use disorders - Sex: Both - Age: Age-standardized (Percent)': 'Alch'})
Subframe of United States Entries¶
grouped = df_sad.groupby('Entity')
print(grouped)
#df_us_sad=grouped
#print(df_us_sad)
#df_us_sad = df_sad[['Code', 'Year', 'Schizo']]
#df_united_inSad = df_sad[df_sad['Entity' == 'United States']]
df_united_inSad = df_sad[df_sad['Entity'] == 'United States']
#print(df_us_new)
#df_us_sad.to_pickle('df_us_new_schizo.pkl')
print(df_united_inSad.head())
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
# Assuming df is your DataFrame and it already exists
# df = pd.DataFrame(np.random.randn(1000, 4), columns=['a', 'b', 'c', 'd'])
scatter_matrix(df_united_inSad, alpha=0.2, figsize=(6, 6), diagonal='kde')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x103664250>
Entity Code Year Schizo Depp Alch
6330 United States USA 1990 0.467115 4.068695 3.111360
6331 United States USA 1991 0.472488 4.196610 3.022482
6332 United States USA 1992 0.477502 4.323224 2.937071
6333 United States USA 1993 0.481847 4.443956 2.858443
6334 United States USA 1994 0.485216 4.554388 2.790028
array([[<Axes: xlabel='Year', ylabel='Year'>,
<Axes: xlabel='Schizo', ylabel='Year'>,
<Axes: xlabel='Depp', ylabel='Year'>,
<Axes: xlabel='Alch', ylabel='Year'>],
[<Axes: xlabel='Year', ylabel='Schizo'>,
<Axes: xlabel='Schizo', ylabel='Schizo'>,
<Axes: xlabel='Depp', ylabel='Schizo'>,
<Axes: xlabel='Alch', ylabel='Schizo'>],
[<Axes: xlabel='Year', ylabel='Depp'>,
<Axes: xlabel='Schizo', ylabel='Depp'>,
<Axes: xlabel='Depp', ylabel='Depp'>,
<Axes: xlabel='Alch', ylabel='Depp'>],
[<Axes: xlabel='Year', ylabel='Alch'>,
<Axes: xlabel='Schizo', ylabel='Alch'>,
<Axes: xlabel='Depp', ylabel='Alch'>,
<Axes: xlabel='Alch', ylabel='Alch'>]], dtype=object)
pip install statsmodels
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/pimutils/khal/pypi/simple Requirement already satisfied: statsmodels in /opt/homebrew/lib/python3.11/site-packages (0.14.0) Requirement already satisfied: numpy>=1.18 in /opt/homebrew/lib/python3.11/site-packages (from statsmodels) (1.25.0) Requirement already satisfied: scipy!=1.9.2,>=1.4 in /opt/homebrew/lib/python3.11/site-packages (from statsmodels) (1.11.1) Requirement already satisfied: pandas>=1.0 in /opt/homebrew/lib/python3.11/site-packages (from statsmodels) (2.0.3) Requirement already satisfied: patsy>=0.5.2 in /opt/homebrew/lib/python3.11/site-packages (from statsmodels) (0.5.3) Requirement already satisfied: packaging>=21.3 in /opt/homebrew/lib/python3.11/site-packages (from statsmodels) (23.1) Requirement already satisfied: python-dateutil>=2.8.2 in /opt/homebrew/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2023.3) Requirement already satisfied: tzdata>=2022.1 in /opt/homebrew/lib/python3.11/site-packages (from pandas>=1.0->statsmodels) (2023.3) Requirement already satisfied: six in /opt/homebrew/lib/python3.11/site-packages (from patsy>=0.5.2->statsmodels) (1.16.0) [notice] A new release of pip is available: 23.0.1 -> 23.2.1 [notice] To update, run: python3.11 -m pip install --upgrade pip Note: you may need to restart the kernel to use updated packages.
Now let's pause and consider what is possible with what we have now. Our `df
Now let's pause and consider what is possible with what we have now. Our df_united... data frame can pretty clearly be used to examine lineawr relationships amongst Schizophrenia, Depression and Alcoholism (in the United States).
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
print(df_united_inSad.head())
print(df_united_inSad.describe())
plt.figure(figsize=(10, 6))
plt.scatter(df_united_inSad['Year'], df_united_inSad['Schizo'], label='Schizo')
plt.scatter(df_united_inSad['Year'], df_united_inSad['Depp'], label='Depp')
plt.scatter(df_united_inSad['Year'], df_united_inSad['Alch'], label='Alch')
plt.xlabel('Year')
plt.ylabel('Value')
plt.legend()
plt.show()
#Now, let's run a linear regression for each of 'Schizo', 'Depp', and 'Alch' against 'Year'. We'll use the OLS function from statsmodels to do this:
# Add a constant to the DataFrame for the linear regression
df_united_inSad = sm.add_constant(df_united_inSad)
# Run a linear regression for 'Schizo' against 'Year'
model_schizo = sm.OLS(df_united_inSad['Schizo'], df_united_inSad[['const', 'Year']])
results_schizo = model_schizo.fit()
print(results_schizo.summary())
# Run a linear regression for 'Depp' against 'Year'
model_depp = sm.OLS(df_united_inSad['Depp'], df_united_inSad[['const', 'Year']])
results_depp = model_depp.fit()
print(results_depp.summary())
# Run a linear regression for 'Alch' against 'Year'
model_alch = sm.OLS(df_united_inSad['Alch'], df_united_inSad[['const', 'Year']])
results_alch = model_alch.fit()
print(results_alch.summary())
Entity Code Year Schizo Depp Alch
6330 United States USA 1990 0.467115 4.068695 3.111360
6331 United States USA 1991 0.472488 4.196610 3.022482
6332 United States USA 1992 0.477502 4.323224 2.937071
6333 United States USA 1993 0.481847 4.443956 2.858443
6334 United States USA 1994 0.485216 4.554388 2.790028
Year Schizo Depp Alch
count 30.000000 30.000000 30.000000 30.000000
mean 2004.500000 0.489256 4.755258 2.613989
std 8.803408 0.012945 0.225955 0.166118
min 1990.000000 0.467115 4.068695 2.481960
25% 1997.250000 0.480224 4.732868 2.514782
50% 2004.500000 0.490101 4.833537 2.555633
75% 2011.750000 0.501105 4.910971 2.603112
max 2019.000000 0.506018 4.933798 3.111360
OLS Regression Results
==============================================================================
Dep. Variable: Schizo R-squared: 0.000
Model: OLS Adj. R-squared: -0.036
Method: Least Squares F-statistic: 0.0001566
Date: Fri, 04 Aug 2023 Prob (F-statistic): 0.990
Time: 19:45:07 Log-Likelihood: 88.353
No. Observations: 30 AIC: -172.7
Df Residuals: 28 BIC: -169.9
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.4823 0.557 0.866 0.394 -0.659 1.623
Year 3.477e-06 0.000 0.013 0.990 -0.001 0.001
==============================================================================
Omnibus: 4.164 Durbin-Watson: 0.069
Prob(Omnibus): 0.125 Jarque-Bera (JB): 1.934
Skew: -0.296 Prob(JB): 0.380
Kurtosis: 1.907 Cond. No. 4.64e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.64e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Regression Results
==============================================================================
Dep. Variable: Depp R-squared: 0.293
Model: OLS Adj. R-squared: 0.268
Method: Least Squares F-statistic: 11.61
Date: Fri, 04 Aug 2023 Prob (F-statistic): 0.00200
Time: 19:45:07 Log-Likelihood: 7.7676
No. Observations: 30 AIC: -11.54
Df Residuals: 28 BIC: -8.733
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -23.1025 8.174 -2.826 0.009 -39.847 -6.358
Year 0.0139 0.004 3.408 0.002 0.006 0.022
==============================================================================
Omnibus: 3.922 Durbin-Watson: 0.074
Prob(Omnibus): 0.141 Jarque-Bera (JB): 3.446
Skew: -0.815 Prob(JB): 0.179
Kurtosis: 2.685 Cond. No. 4.64e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.64e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
OLS Regression Results
==============================================================================
Dep. Variable: Alch R-squared: 0.556
Model: OLS Adj. R-squared: 0.540
Method: Least Squares F-statistic: 35.04
Date: Fri, 04 Aug 2023 Prob (F-statistic): 2.27e-06
Time: 19:45:07 Log-Likelihood: 23.966
No. Observations: 30 AIC: -43.93
Df Residuals: 28 BIC: -41.13
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 30.8142 4.764 6.468 0.000 21.056 40.573
Year -0.0141 0.002 -5.920 0.000 -0.019 -0.009
==============================================================================
Omnibus: 2.070 Durbin-Watson: 0.091
Prob(Omnibus): 0.355 Jarque-Bera (JB): 1.012
Skew: 0.401 Prob(JB): 0.603
Kurtosis: 3.407 Cond. No. 4.64e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.64e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Pickle and DataFrames¶
One nice thing about csv files is that they can be readily shared. How about Python objects? Well we have two options for saving our dataframes. We can methods that ship with Pandas or we can use the pickle library. The disadvantage of the Pickle library is that it saves the dataframes a list. So to avoid extra work we use the built in method.
Pickling the United States Dataframe¶
#import pickle
# Assuming dfs is your list of DataFrames
#with open('dataframes_sui.pkl', 'wb') as f:
#pickle.dump(df_us_new, f)
#df_us_new.to_pickle("df_new_ca_math8.pkl")
!ls
Bringing DataFrames Together.ipynb Crimes_-_2001_to_Present.csv Data_Analysis_Notebook_HAHA_pleasebehere.ipynb Education Data United States Exploration.ipynb Gallup_rel.csv JavaScript Data Analysis Similar Project 2023.ipynb Mental Health Data Exploration .ipynb Religious Data Exploration Process.ipynb Stocks Data .ipynb Suicides In US .ipynb Tech Stocks, MH and Suicide Rates.ipynb US Religion Census Top 0 Religious Groups (2020) (Downloaded from thearda.com on Sun Jul 30, 2023).csv US Schizophrenia, Stock Prices and .ipynb df_stocks.pkl mentalData.jpeg merged_df_mentalhealthde2023-08-04 18:53:36.pickle prevalence-by-mental-and-substance-use-disorder.csv religious.csv states_edu.csv stock_data.csv us_sui_rates.csv
#import pandas as pd
#pickled_dfs = ['dataframes_adj_religion.pkl', 'dataframes_edu.pkl', 'dataframes_new_ca_math.pkl', 'dataframes_stocks.pkl', 'dataframes_sui.pkl']
#dataframes = []
# Loop through the list of pickle files
#for file in pickled_dfs:
# Load the DataFrame from the pickle file and append it to the list
# df = pd.read_pickle(file)
# dataframes.append(df)
#print(dataframes)
!ls|grep ipyn
Bringing DataFrames Together.ipynb Data_Analysis_Notebook_HAHA_pleasebehere.ipynb Education Data United States Exploration.ipynb JavaScript Data Analysis Similar Project 2023.ipynb Mental Health Data Exploration .ipynb Religious Data Exploration Process.ipynb Stocks Data .ipynb Suicides In US .ipynb Tech Stocks, MH and Suicide Rates.ipynb US Schizophrenia, Stock Prices and .ipynb
import glob
import pandas as pd
# Find all pickle files in the current directory
pickled_dfs = glob.glob('./*.pkl')
dataframes_from = []
def import_pkls_from_dir():
# Loop through the list of pickle files
for file in pickled_dfs:
# Load the DataFrame from the pickle file and append it to the list
df = pd.read_pickle(file)
dataframes_from.append(df)
print(dataframes_from)
import_pkls_from_dir()
print(dataframes_from)
[ Ticker Year Average Close 0 AAPL 2010 9.280088 1 AAPL 2011 13.000154 2 AAPL 2012 20.573204 3 AAPL 2013 16.879817 4 AAPL 2014 23.066164 5 AAPL 2015 30.010000 6 AAPL 2016 26.151002 7 AAPL 2017 37.637769 8 AAPL 2018 47.263357 9 AAPL 2019 52.063988 10 AAPL 2020 95.347075 11 AAPL 2021 140.989365 12 AAPL 2022 154.835060 13 AAPL 2023 165.059231 14 MSFT 2010 27.058571 15 MSFT 2011 26.052341 16 MSFT 2012 29.820960 17 MSFT 2013 32.491905 18 MSFT 2014 42.453492 19 MSFT 2015 46.714286 20 MSFT 2016 55.259325 21 MSFT 2017 71.984024 22 MSFT 2018 101.033984 23 MSFT 2019 130.382024 24 MSFT 2020 193.026126 25 MSFT 2021 275.940754 26 MSFT 2022 268.917091 27 MSFT 2023 291.917272 28 NVDA 2010 3.390794 29 NVDA 2011 4.221885 30 NVDA 2012 3.376970 31 NVDA 2013 3.547282 32 NVDA 2014 4.636766 33 NVDA 2015 5.929563 34 NVDA 2016 13.440298 35 NVDA 2017 37.445926 36 NVDA 2018 58.096036 37 NVDA 2019 43.646924 38 NVDA 2020 98.907085 39 NVDA 2021 195.219931 40 NVDA 2022 185.689562 41 NVDA 2023 297.112867] [ Ticker Year Average Close 0 AAPL 2010 9.280088 1 AAPL 2011 13.000154 2 AAPL 2012 20.573204 3 AAPL 2013 16.879817 4 AAPL 2014 23.066164 5 AAPL 2015 30.010000 6 AAPL 2016 26.151002 7 AAPL 2017 37.637769 8 AAPL 2018 47.263357 9 AAPL 2019 52.063988 10 AAPL 2020 95.347075 11 AAPL 2021 140.989365 12 AAPL 2022 154.835060 13 AAPL 2023 165.059231 14 MSFT 2010 27.058571 15 MSFT 2011 26.052341 16 MSFT 2012 29.820960 17 MSFT 2013 32.491905 18 MSFT 2014 42.453492 19 MSFT 2015 46.714286 20 MSFT 2016 55.259325 21 MSFT 2017 71.984024 22 MSFT 2018 101.033984 23 MSFT 2019 130.382024 24 MSFT 2020 193.026126 25 MSFT 2021 275.940754 26 MSFT 2022 268.917091 27 MSFT 2023 291.917272 28 NVDA 2010 3.390794 29 NVDA 2011 4.221885 30 NVDA 2012 3.376970 31 NVDA 2013 3.547282 32 NVDA 2014 4.636766 33 NVDA 2015 5.929563 34 NVDA 2016 13.440298 35 NVDA 2017 37.445926 36 NVDA 2018 58.096036 37 NVDA 2019 43.646924 38 NVDA 2020 98.907085 39 NVDA 2021 195.219931 40 NVDA 2022 185.689562 41 NVDA 2023 297.112867]
Bringing In A New Dataset¶
Now I would like to get a feel for another important aspect of data analysis. What we are to do next in sequence:
- perform "data engineering":merging datasets ;
- perform "data analysis" by exploring our data after its prepared and possibly forming questions we might seek to answer or at least explore;
- perform machine learning if the engineered data is amenable to corredsponding methods/techniques.
The new Dataset
It is natural to wonder what relationships over time might hold between suicide rates and these mental disorders we've been exploring. We bring in a new dataset and use it to see if we might uncover expected (or unexpected relationships).
import pandas as pd
df_sui_rates = pd.read_csv('us_sui_rates.csv')
print(df_sui_rates.head())
df_simps = df_sui_rates[['INDICATOR', 'YEAR', 'AGE_NUM','ESTIMATE']]
INDICATOR UNIT \ 0 Death rates for suicide Deaths per 100,000 resident population, age-ad... 1 Death rates for suicide Deaths per 100,000 resident population, age-ad... 2 Death rates for suicide Deaths per 100,000 resident population, age-ad... 3 Death rates for suicide Deaths per 100,000 resident population, age-ad... 4 Death rates for suicide Deaths per 100,000 resident population, age-ad... UNIT_NUM STUB_NAME STUB_NAME_NUM STUB_LABEL STUB_LABEL_NUM YEAR \ 0 1 Total 0 All persons 0.0 1950 1 1 Total 0 All persons 0.0 1960 2 1 Total 0 All persons 0.0 1970 3 1 Total 0 All persons 0.0 1980 4 1 Total 0 All persons 0.0 1981 YEAR_NUM AGE AGE_NUM ESTIMATE FLAG 0 1 All ages 0.0 13.2 NaN 1 2 All ages 0.0 12.5 NaN 2 3 All ages 0.0 13.1 NaN 3 4 All ages 0.0 12.2 NaN 4 5 All ages 0.0 12.3 NaN
print(df_simps.head(10))
num_unique_values = df_simps['INDICATOR'].nunique()
print(num_unique_values)
# Identify columns to drop
cols_to_drop = [col for col in df_simps.columns if df_simps[col].nunique() == 1]
# Drop these columns
df_dsimps = df_simps.drop(cols_to_drop, axis=1)
print(df_dsimps.head())
INDICATOR YEAR AGE_NUM ESTIMATE 0 Death rates for suicide 1950 0.0 13.2 1 Death rates for suicide 1960 0.0 12.5 2 Death rates for suicide 1970 0.0 13.1 3 Death rates for suicide 1980 0.0 12.2 4 Death rates for suicide 1981 0.0 12.3 5 Death rates for suicide 1982 0.0 12.5 6 Death rates for suicide 1983 0.0 12.4 7 Death rates for suicide 1984 0.0 12.6 8 Death rates for suicide 1985 0.0 12.5 9 Death rates for suicide 1986 0.0 13.0 1 YEAR AGE_NUM ESTIMATE 0 1950 0.0 13.2 1 1960 0.0 12.5 2 1970 0.0 13.1 3 1980 0.0 12.2 4 1981 0.0 12.3
Before we try and merge data sets let's consider what we have here. We see a columnt hat could be said to be shared at lesat in part between this new dataframe and our schizophrenia dataset/dataframe. We can therefore for one, plan on merging along shared years.
df_sui_rates.columns
df_sui_rates.to_pickle('sui0804.pkl')
df_united_inSad.to_pickle('unisad.pkl')
!pwd
/Users/josechavez/WEB/jc_micro_projects/dataSnA/projectUno/notebooks
print(df_united_inSad.head(50))
const Entity Code Year Schizo Depp Alch 6330 1.0 United States USA 1990 0.467115 4.068695 3.111360 6331 1.0 United States USA 1991 0.472488 4.196610 3.022482 6332 1.0 United States USA 1992 0.477502 4.323224 2.937071 6333 1.0 United States USA 1993 0.481847 4.443956 2.858443 6334 1.0 United States USA 1994 0.485216 4.554388 2.790028 6335 1.0 United States USA 1995 0.487275 4.649917 2.735402 6336 1.0 United States USA 1996 0.488420 4.728877 2.680183 6337 1.0 United States USA 1997 0.489322 4.792297 2.616027 6338 1.0 United States USA 1998 0.490074 4.841765 2.555207 6339 1.0 United States USA 1999 0.490764 4.877494 2.510123 6340 1.0 United States USA 2000 0.491473 4.899882 2.493354 6341 1.0 United States USA 2001 0.493311 4.910241 2.499927 6342 1.0 United States USA 2002 0.496664 4.912093 2.513923 6343 1.0 United States USA 2003 0.500509 4.909044 2.531130 6344 1.0 United States USA 2004 0.503822 4.906857 2.547328 6345 1.0 United States USA 2005 0.505568 4.911215 2.558304 6346 1.0 United States USA 2006 0.505987 4.919271 2.563083 6347 1.0 United States USA 2007 0.506018 4.926512 2.564368 6348 1.0 United States USA 2008 0.505684 4.931658 2.563379 6349 1.0 United States USA 2009 0.505018 4.933798 2.561372 6350 1.0 United States USA 2010 0.504039 4.932534 2.559565 6351 1.0 United States USA 2011 0.501304 4.914326 2.556059 6352 1.0 United States USA 2012 0.496241 4.873906 2.549004 6353 1.0 United States USA 2013 0.490129 4.825310 2.539524 6354 1.0 United States USA 2014 0.484211 4.780310 2.528673 6355 1.0 United States USA 2015 0.479682 4.751602 2.517357 6356 1.0 United States USA 2016 0.473109 4.740377 2.500322 6357 1.0 United States USA 2017 0.467995 4.735252 2.484274 6358 1.0 United States USA 2018 0.467770 4.732570 2.481960 6359 1.0 United States USA 2019 0.469107 4.733761 2.490441
print(df_dsimps.head(50))
YEAR AGE_NUM ESTIMATE 0 1950 0.0 13.2 1 1960 0.0 12.5 2 1970 0.0 13.1 3 1980 0.0 12.2 4 1981 0.0 12.3 5 1982 0.0 12.5 6 1983 0.0 12.4 7 1984 0.0 12.6 8 1985 0.0 12.5 9 1986 0.0 13.0 10 1987 0.0 12.8 11 1988 0.0 12.5 12 1989 0.0 12.3 13 1990 0.0 12.5 14 1991 0.0 12.3 15 1992 0.0 12.0 16 1993 0.0 12.1 17 1994 0.0 11.9 18 1995 0.0 11.8 19 1996 0.0 11.5 20 1997 0.0 11.2 21 1998 0.0 11.1 22 1999 0.0 10.5 23 2000 0.0 10.4 24 2001 0.0 10.7 25 2002 0.0 10.9 26 2003 0.0 10.8 27 2004 0.0 11.0 28 2005 0.0 10.9 29 2006 0.0 11.0 30 2007 0.0 11.3 31 2008 0.0 11.6 32 2009 0.0 11.8 33 2010 0.0 12.1 34 2011 0.0 12.3 35 2012 0.0 12.6 36 2013 0.0 12.6 37 2014 0.0 13.0 38 2015 0.0 13.3 39 2016 0.0 13.5 40 2017 0.0 14.0 41 2018 0.0 14.2 42 1950 0.0 21.2 43 1960 0.0 20.0 44 1970 0.0 19.8 45 1980 0.0 19.9 46 1981 0.0 19.8 47 1982 0.0 20.4 48 1983 0.0 20.4 49 1984 0.0 20.9
Merging Data sets¶
Storing a dataframe with simpler column names in a new dataframe.
Now let's try and merge dataframes. Notice that b
print(df_united_inSad.head(20))
print(df_dsimps.head(20))
# Merge the dataframes (same as i had before)
#merged_df = df_united_inSad.merge(df_dsimps, left_on='Year', right_on='YEAR', how='left')
# Display the first few rows of the merged dataframe
#merged_df.head()
#merged_df = df_united_inSad.merge(df_dsimps, left_on='Year', right_on='YEAR', how='left')
# Aggregate df_dsimps by year, taking the mean of the other columns
df_dsimps_agg = df_dsimps.groupby('YEAR').mean().reset_index()
# Merge the dataframes again
merged_df_agg = df_united_inSad.merge(df_dsimps_agg, left_on='Year', right_on='YEAR', how='left')
# Display the first few rows of the new merged dataframe
merged_df_agg.head()
#I want to keep the merge_df name for now (through the aid in getting unique values iwht GPTs help) 2023-08-04 20:11:07
#print(merged_df.head(10))
merged_df=merged_df_agg
#display(merged_df)
const Entity Code Year Schizo Depp Alch
6330 1.0 United States USA 1990 0.467115 4.068695 3.111360
6331 1.0 United States USA 1991 0.472488 4.196610 3.022482
6332 1.0 United States USA 1992 0.477502 4.323224 2.937071
6333 1.0 United States USA 1993 0.481847 4.443956 2.858443
6334 1.0 United States USA 1994 0.485216 4.554388 2.790028
6335 1.0 United States USA 1995 0.487275 4.649917 2.735402
6336 1.0 United States USA 1996 0.488420 4.728877 2.680183
6337 1.0 United States USA 1997 0.489322 4.792297 2.616027
6338 1.0 United States USA 1998 0.490074 4.841765 2.555207
6339 1.0 United States USA 1999 0.490764 4.877494 2.510123
6340 1.0 United States USA 2000 0.491473 4.899882 2.493354
6341 1.0 United States USA 2001 0.493311 4.910241 2.499927
6342 1.0 United States USA 2002 0.496664 4.912093 2.513923
6343 1.0 United States USA 2003 0.500509 4.909044 2.531130
6344 1.0 United States USA 2004 0.503822 4.906857 2.547328
6345 1.0 United States USA 2005 0.505568 4.911215 2.558304
6346 1.0 United States USA 2006 0.505987 4.919271 2.563083
6347 1.0 United States USA 2007 0.506018 4.926512 2.564368
6348 1.0 United States USA 2008 0.505684 4.931658 2.563379
6349 1.0 United States USA 2009 0.505018 4.933798 2.561372
YEAR AGE_NUM ESTIMATE
0 1950 0.0 13.2
1 1960 0.0 12.5
2 1970 0.0 13.1
3 1980 0.0 12.2
4 1981 0.0 12.3
5 1982 0.0 12.5
6 1983 0.0 12.4
7 1984 0.0 12.6
8 1985 0.0 12.5
9 1986 0.0 13.0
10 1987 0.0 12.8
11 1988 0.0 12.5
12 1989 0.0 12.3
13 1990 0.0 12.5
14 1991 0.0 12.3
15 1992 0.0 12.0
16 1993 0.0 12.1
17 1994 0.0 11.9
18 1995 0.0 11.8
19 1996 0.0 11.5
Dropping the Entity¶
Since we already know (in context) that everything in this latest dataframe is related to United States data, we can go ahead and drop that column.
Models to Try (Random Forest)¶
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
merged_df['ESTIMATE'] = merged_df['ESTIMATE'].fillna(merged_df['ESTIMATE'].mean())
# Assume that you want to predict 'ESTIMATE' based on ['Schizo', 'Depp', 'Alch', 'AGE_NUM']
X = merged_df[['Schizo', 'Depp', 'Alch', 'AGE_NUM']]
y = merged_df['ESTIMATE']
X = X.fillna(X.mean())
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
# Fit the model
model.fit(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
merged_df.to_pickle('merged_df_mentalhealthde2023-08-04 18:53:36.pickle')
Discusssion on Random Forests :
Let's take a look at what random forests are (roughly speaking) as well as how to evaluate them and what plots can tell us about our models' accuracy.
Random Forest Model: The Random Forest model is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction of the individual trees for regression tasks or the class that has the most votes for classification tasks. The main idea behind Random Forest is that a group of "weak learners" can come together to form a "strong learner". Each tree is trained on a random subset of the data and a random subset of the features, which helps to make the model robust to overfitting.
Model Evaluation Metrics: The Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) are all measures of the difference between the values predicted by the model and the actual values.
MAE is the average of the absolute differences between the predicted and actual values. It gives an idea of how wrong the predictions were. The measure gives an equal weight to all errors, whether they are big or small. In contrast to the MSE, the MAE needs no square root to be understood and its unit is the same as the original data.
MSE is similar to MAE but squares the differences before they are averaged. This means that larger errors are noted more than with MAE. This can be useful if you want to pay more attention to outliers.
RMSE is the square root of MSE. It is more sensitive to larger errors than MAE, but its interpretation is easier than MSE because its units are the same as the original data.
Scatter Plot: The scatter plot of actual vs predicted values is a graphical tool for comparing the model's predictions with the actual values. Each point on the plot represents a single data point (or instance). The x-coordinate of the point represents the actual value of the target variable for that instance, and the y-coordinate represents the value that the model predicted.
If the model's predictions are perfect, all points would lie on a straight diagonal line going from the bottom left to the top right of the plot. This is because for every instance, the predicted value would be equal to the actual value, placing the point along this line.
If the points are scattered far from this line, it means that there are instances where the model's predictions are far from the actual values. The further a point is from the line, the larger the error for that instance.
If the points seem to form a pattern that deviates from the diagonal line (for example, if they form a curve), it might indicate that the model is systematically mispredicting certain values. This could suggest that the model could be improved, for example by using a different type of model or by engineering new features.
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
# Calculate errors
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
print('Mean Absolute Error:', mae)
print('Mean Squared Error:', mse)
print('Root Mean Squared Error:', rmse)
Mean Absolute Error: 8.42745575493068 Mean Squared Error: 136.25288581868358 Root Mean Squared Error: 11.67274114416505
import matplotlib.pyplot as plt
# Scatter plot of actual vs predicted values
plt.scatter(y_test, predictions)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()
Arima Model:¶
from statsmodels.tsa.arima.model import ARIMA
# Assume that you want to forecast 'ESTIMATE' and df is your DataFrame
model = ARIMA(merged_df['ESTIMATE'], order=(5,1,0)) # (p,d,q) where p = periods taken for autoregressive model
# d = Integrated order, difference
# q = periods in moving average model
model_fit = model.fit()
print(model_fit.summary())
# plot residual errors
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
plt.show()
SARIMAX Results
==============================================================================
Dep. Variable: ESTIMATE No. Observations: 4597
Model: ARIMA(5, 1, 0) Log Likelihood -16728.300
Date: Fri, 04 Aug 2023 AIC 33468.599
Time: 19:45:08 BIC 33507.197
Sample: 0 HQIC 33482.185
- 4597
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.3687 0.011 -32.898 0.000 -0.391 -0.347
ar.L2 -0.1969 0.014 -14.382 0.000 -0.224 -0.170
ar.L3 -0.1193 0.018 -6.782 0.000 -0.154 -0.085
ar.L4 -0.2337 0.017 -13.845 0.000 -0.267 -0.201
ar.L5 -0.1393 0.020 -7.094 0.000 -0.178 -0.101
sigma2 84.8863 1.186 71.568 0.000 82.562 87.211
===================================================================================
Ljung-Box (L1) (Q): 1.15 Jarque-Bera (JB): 5814.97
Prob(Q): 0.28 Prob(JB): 0.00
Heteroskedasticity (H): 0.97 Skew: -0.21
Prob(H) (two-sided): 0.58 Kurtosis: 8.49
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Let's examine the significance of each returned parameter.
Dep. Variable: This is the dependent variable that the model is trying to predict. In your case, it's 'ESTIMATE'.
No. Observations: This is the number of observations (i.e., data points or rows in your DataFrame) that the model was trained on.
Model: This is the type of model that was used, along with the order of the AR (autoregressive), I (integrated), and MA (moving average) components. In your case, it's an ARIMA model with order (5, 1, 0).
Log Likelihood: This is the log of the likelihood function, which is a measure of how well the model fits the data. The higher the log likelihood, the better the model fits the data.
AIC: This stands for Akaike Information Criterion. It's a measure that takes into account both the goodness of fit of the model and the complexity of the model. The lower the AIC, the better the model is considered to be.
BIC: This stands for Bayesian Information Criterion. Like AIC, it also takes into account the goodness of fit and the complexity of the model, but it penalizes models with more parameters more heavily than AIC. The lower the BIC, the better the model is considered to be.
HQIC: This stands for Hannan-Quinn Information Criterion. It's another measure like AIC and BIC, but with a different penalty for the number of parameters. The lower the HQIC, the better the model is considered to be.
Covariance Type: This indicates the method used to compute the covariance matrix of the parameters. In your case, it's 'opg' which stands for outer product of gradients.
coef: These are the coefficients of the AR terms in the model. Each coefficient represents the impact of a one-unit increase in the corresponding lagged value of the dependent variable on the current value of the dependent variable.
std err: These are the standard errors of the coefficients. The standard error is a measure of the statistical accuracy of an estimate.
z: These are the z-scores of the coefficients. The z-score is the coefficient divided by its standard error.
P>|z|: These are the p-values associated with the z-scores. The p-value is the probability of obtaining a result as extreme as the observed result, under the null hypothesis.
[0.025 0.975]: These are the 95% confidence intervals for the coefficients. The confidence interval is a range of values, derived from the standard error, that is likely to contain the true value of the coefficient.
sigma2: This is the variance of the residuals (i.e., the difference between the observed and predicted values).
Ljung-Box (L1) (Q): This is a test statistic for the null hypothesis that the residuals are independently distributed. If the p-value (Prob(Q)) is small, you can reject the null hypothesis and conclude that the residuals are not independently distributed.
Jarque-Bera (JB): This is a test statistic for the null hypothesis that the residuals are normally distributed. If the p-value (Prob(JB)) is small, you can reject the null hypothesis and conclude that the residuals are not normally distributed.
Heteroskedasticity (H): This is a test statistic for the null hypothesis that the variance of the residuals is constant (i.e., the residuals are homoskedastic). If the p-value (Prob(H)) is small, you can reject the null hypothesis and conclude that the residuals are heteroskedastic.
Skew: This is a measure of the asymmetry of the distribution of the residuals. A positive skew indicates that the residuals are skewed to the right, and a negative skew indicates that they are skewed to the left.
Kurtosis: This is a measure of the "tailedness" of the distribution of the residuals. A high kurtosis indicates that the distribution has heavy tails and a sharp peak, and a low kurtosis indicates that the distribution has light tails and a flat peak.
Summary and next steps:
Recall that the column "Estimate" came from the data on suicide rates in the US. So with that in mind let's come up with some interpretations of our work herein and future directions:
- Interpretations
- Future:
- merge another data set and explore more (our plan is stocks data from the three big tech companies)